AllLife Bank Project¶

Overview¶

AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customer queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.

Objective¶

To identify different segments in the existing customer, based on their spending patterns as well as past interaction with the bank, using clustering algorithms, and provide recommendations to the bank on how to better market to and service these customers.

Goals¶

  • Indentify Trends
  • Construct appropiate good quality models
  • Provide suitable actionable insights

Data Description¶

The data provided is of various customers of a bank and their financial attributes like credit limit, the total number of credit cards the customer has, and different channels through which customers have contacted the bank for any queries (including visiting the bank, online and through a call center).

Data Dictionary¶

  • Sl_No: Primary key of the records
  • Customer Key: Customer identification number
  • Average Credit Limit: Average credit limit of each customer for all credit cards
  • Total credit cards: Total number of credit cards possessed by the customer
  • Total visits bank: Total number of visits that customer made (yearly) personally to the bank
  • Total visits online: Total number of visits or online logins made by the customer (yearly)
  • Total calls made: Total number of calls made by the customer to the bank or its customer service department (yearly)

Initial Setup¶

Libraries¶

In [116]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
!pip install pandas-profiling==1.1.0
import pandas_profiling

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

import seaborn as sns

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.datasets import make_blobs


# to visualize the elbow curve and silhouette scores
!pip install yellowbrick
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer


# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# to compute distances
from scipy.spatial.distance import pdist

# to perform PCA
from sklearn.decomposition import PCA

# Pandas dataframe options
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_colwidth',400)


# set the background for the graphs
plt.style.use('ggplot')

# For pandas profiling
from pandas_profiling import ProfileReport

# Printing style
from tabulate import tabulate

# Library to suppress warnings or deprecation notes 
import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: pandas-profiling==1.1.0 in /opt/anaconda3/lib/python3.9/site-packages (1.1.0)
Requirement already satisfied: pandas>=0.16 in /opt/anaconda3/lib/python3.9/site-packages (from pandas-profiling==1.1.0) (1.4.4)
Requirement already satisfied: matplotlib>=1.4 in /opt/anaconda3/lib/python3.9/site-packages (from pandas-profiling==1.1.0) (3.5.2)
Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (9.2.0)
Requirement already satisfied: numpy>=1.17 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (1.21.5)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (1.4.2)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (21.3)
Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (4.25.0)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (0.11.0)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib>=1.4->pandas-profiling==1.1.0) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/anaconda3/lib/python3.9/site-packages (from pandas>=0.16->pandas-profiling==1.1.0) (2022.1)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=1.4->pandas-profiling==1.1.0) (1.16.0)
Requirement already satisfied: yellowbrick in /opt/anaconda3/lib/python3.9/site-packages (1.5)
Requirement already satisfied: cycler>=0.10.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (0.11.0)
Requirement already satisfied: scipy>=1.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.9.1)
Requirement already satisfied: numpy>=1.16.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.21.5)
Requirement already satisfied: scikit-learn>=1.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (1.0.2)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /opt/anaconda3/lib/python3.9/site-packages (from yellowbrick) (3.5.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2)
Requirement already satisfied: pyparsing>=2.2.1 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9)
Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (9.2.0)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (21.3)
Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.9/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.25.0)
Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=1.0.0->yellowbrick) (2.2.0)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
In [117]:
!pip install openpyxl
Requirement already satisfied: openpyxl in /opt/anaconda3/lib/python3.9/site-packages (3.0.10)
Requirement already satisfied: et_xmlfile in /opt/anaconda3/lib/python3.9/site-packages (from openpyxl) (1.1.0)

Loading the dataset¶

In [118]:
df = pd.read_excel('Credit+Card+Customer+Data.xlsx')
In [119]:
data = df.copy()

Dataset Summary¶

Shape of the data¶

In [120]:
print('There are {row} records, and {col} columns in the dataset'.format(row=data.shape[0], col=data.shape[1]))
There are 660 records, and 7 columns in the dataset

Missing Values¶

In [121]:
data.isnull().sum()
Out[121]:
Sl_No                  0
Customer Key           0
Avg_Credit_Limit       0
Total_Credit_Cards     0
Total_visits_bank      0
Total_visits_online    0
Total_calls_made       0
dtype: int64

There are no missing values in the dataset

Duplicate data¶

In [122]:
data.duplicated().sum()
Out[122]:
0

There are no duplicate records

Sample records¶

In [123]:
data.sample(10)
Out[123]:
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
63 64 91673 14000 3 0 4 6
76 77 90168 17000 1 2 4 6
241 242 81878 10000 4 5 1 3
412 413 92140 31000 6 5 2 1
609 610 11562 38000 4 3 2 0
322 323 23881 8000 6 3 1 2
554 555 97850 33000 5 2 2 0
617 618 98216 136000 8 0 13 0
560 561 35268 46000 6 3 1 0
546 547 44157 68000 7 2 1 4

Data Types¶

In [124]:
data.dtypes
Out[124]:
Sl_No                  int64
Customer Key           int64
Avg_Credit_Limit       int64
Total_Credit_Cards     int64
Total_visits_bank      int64
Total_visits_online    int64
Total_calls_made       int64
dtype: object

All the attributes are integers

Checking keys¶

There are only one possible attribute for key, the Customer Key. We'll first convert the Sl_No to index, and then check the Customer Key. Since this data is to be used for customer segmentation, findig the customer key is essential.

I have already checked this in excel, hence depicting the same in the notebook. Considering the small size of the dataset, is it really convenient to check a few initial things in the excel file itself first.

In [125]:
data = data.set_index(['Sl_No'])
In [126]:
data.index.max()
Out[126]:
660
In [127]:
data['Customer Key'].nunique()
Out[127]:
655

There appears to be 5 duplicates in the customer key

Analyzing the records with duplicate customer keys¶

In [128]:
data[data['Customer Key'].isin(data[data['Customer Key'].duplicated()]['Customer Key'].tolist())].sort_values('Customer Key')
Out[128]:
Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
Sl_No
49 37252 6000 4 0 2 8
433 37252 59000 6 2 1 2
5 47437 100000 6 0 12 3
333 47437 17000 7 3 1 0
412 50706 44000 4 5 0 2
542 50706 60000 7 5 2 2
392 96929 13000 4 5 0 0
399 96929 67000 6 2 2 2
105 97935 17000 2 1 2 10
633 97935 187000 7 1 7 0

The records for same Customer Key look really different from each other. I am assuming this is either mistake in the Customer Key assignment, or we are missing current_version_indicator in the dataset. As of now, I am going to consider these as two different customers. After the clustering, I will analyze the groups corresponding to these sets of records

Standardizing Column Names¶

In [129]:
def feature_name_standardize(df: pd.DataFrame):
    df_ = df.copy()
    df_.columns = [i.replace(" ", "_").lower() for i in df_.columns]
    return df_
In [130]:
data = feature_name_standardize(data)

Column Statistinc¶

In [131]:
data.describe().T
Out[131]:
count mean std min 25% 50% 75% max
customer_key 660.000 55141.444 25627.772 11265.000 33825.250 53874.500 77202.500 99843.000
avg_credit_limit 660.000 34574.242 37625.488 3000.000 10000.000 18000.000 48000.000 200000.000
total_credit_cards 660.000 4.706 2.168 1.000 3.000 5.000 6.000 10.000
total_visits_bank 660.000 2.403 1.632 0.000 1.000 2.000 4.000 5.000
total_visits_online 660.000 2.606 2.936 0.000 1.000 2.000 4.000 15.000
total_calls_made 660.000 3.583 2.865 0.000 1.000 3.000 5.000 10.000

Exploratory Data Analysis¶

Univariate Analysis¶

The first step of univariate analysis is to check the distribution/spread of the data. This is done using primarily histograms and box plots. Additionally we'll plot each numerical feature on violin plot and cumulative density distribution plot. For these 4 kind of plots, we are building below summary() function to plot each of the numerical attributes. Also, we'll display feature-wise 5 point summary.

In [132]:
def summary(data: pd.DataFrame, x: str):
    '''
    The function prints the 5 point summary and histogram, box plot, 
    violin plot, and cumulative density distribution plots for each 
    feature name passed as the argument.
    
    Parameters:
    ----------
    
    data: pd.Datafraame, the dataset
    x: str, feature name
    
    Usage:
    ------------
    
    summary(data, 'age')
    '''
    x_min = data[x].min()
    x_max = data[x].max()
    Q1 = data[x].quantile(0.25)
    Q2 = data[x].quantile(0.50)
    Q3 = data[x].quantile(0.75)
    
    dict={'Min': x_min, 'Q1': Q1, 'Q2': Q2, 'Q3': Q3, 'Max': x_max}
    df = pd.DataFrame(data=dict, index=['Value'])
    print(f'5 Point Summary of {x.capitalize()} Attribute:\n')
    print(tabulate(df, headers = 'keys', tablefmt = 'psql'))

    fig = plt.figure(figsize=(16, 8))
    plt.subplots_adjust(hspace = 0.6)
    sns.set_palette('Pastel1')
    
    plt.subplot(221, frameon=True)
    ax1 = sns.distplot(data[x], color = 'purple')
    ax1.axvline(
        np.mean(data[x]), color="purple", linestyle="--"
    )  # Add mean to the histogram
    ax1.axvline(
        np.median(data[x]), color="black", linestyle="-"
    )  # Add median to the histogram
    plt.title(f'{x.capitalize()} Density Distribution')
    
    plt.subplot(222, frameon=True)
    ax2 = sns.violinplot(x = data[x], palette = 'Accent', split = True)
    plt.title(f'{x.capitalize()} Violinplot')
    
    plt.subplot(223, frameon=True, sharex=ax1)
    ax3 = sns.boxplot(x=data[x], palette = 'cool', width=0.7, linewidth=0.6, showmeans=True)
    plt.title(f'{x.capitalize()} Boxplot')
    
    plt.subplot(224, frameon=True, sharex=ax2)
    ax4 = sns.kdeplot(data[x], cumulative=True)
    plt.title(f'{x.capitalize()} Cumulative Density Distribution')
    
    plt.show()
In [133]:
print('We will check the summary of below columns: \n', data.columns.tolist())
We will check the summary of below columns: 
 ['customer_key', 'avg_credit_limit', 'total_credit_cards', 'total_visits_bank', 'total_visits_online', 'total_calls_made']

Summary of Avergae Credit Limit¶

In [134]:
summary(data, 'avg_credit_limit')
5 Point Summary of Avg_credit_limit Attribute:

+-------+-------+-------+-------+-------+--------+
|       |   Min |    Q1 |    Q2 |    Q3 |    Max |
|-------+-------+-------+-------+-------+--------|
| Value |  3000 | 10000 | 18000 | 48000 | 200000 |
+-------+-------+-------+-------+-------+--------+

The attribute is right skewed with a lot of outliers

Summary of Total Number of Credit Cards¶

In [135]:
summary(data, 'total_credit_cards')
5 Point Summary of Total_credit_cards Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     1 |    3 |    5 |    6 |    10 |
+-------+-------+------+------+------+-------+

The attribute is fairly normally distributed with a few spikes

Summary of Total Number of Visits to the Bank¶

In [136]:
summary(data, 'total_visits_bank')
5 Point Summary of Total_visits_bank Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    1 |    2 |    4 |     5 |
+-------+-------+------+------+------+-------+

The data is slightly right skewed

Summary of Total Online Visits¶

In [137]:
summary(data, 'total_visits_online')
5 Point Summary of Total_visits_online Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    1 |    2 |    4 |    15 |
+-------+-------+------+------+------+-------+

The data is right skewed and has some outliers to the right

Summary of Total Calls Made to the Bank¶

In [138]:
summary(data, 'total_calls_made')
5 Point Summary of Total_calls_made Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    1 |    3 |    5 |    10 |
+-------+-------+------+------+------+-------+

The data is right skewed

Labeled Bar-plots¶

Creating a function to plot labeled bar plot of the features, with percentage label on data bars.

In [139]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 3))
    else:
        plt.figure(figsize=(n + 1, 3))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

Creating a credit card limit bin out of the available data in the avg credit limit feature

In [140]:
data['cc_spending_bin'] = pd.cut(data['avg_credit_limit'], bins=[3000,	10000,	18000, 	48000,	200000]
                                ,labels=['Very Low', 'Low', 'Mid', 'High'], include_lowest=True)

Credit Card Limit Bins¶

In [141]:
labeled_barplot(data, 'cc_spending_bin', perc=True)

Total Credit Cards¶

In [142]:
labeled_barplot(data, 'total_credit_cards', perc=True)

Total Visits to the Bank¶

In [143]:
labeled_barplot(data, 'total_visits_bank', perc=True)

Total Online Visits¶

In [144]:
labeled_barplot(data, 'total_visits_online', perc=True)

Total Calls Made¶

In [145]:
labeled_barplot(data, 'total_calls_made', perc=True)

Bi-variate Analysis¶

Pair Plot¶

In [146]:
plt.figure(figsize=(20,20));
sns.set(palette="Set2");
sns.pairplot(data.iloc[:, 1:], diag_kind='kde', corner=True);
<Figure size 2000x2000 with 0 Axes>

Heatmap¶

In [147]:
# Plotting correlation heatmap of the features

mask = np.zeros_like(data.iloc[:, 1:].corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True


sns.set(rc={"figure.figsize": (8, 8)})

sns.heatmap(
    data.iloc[:, 1:].corr(),
    cmap=sns.diverging_palette(20, 220, n=200),
    annot=True,
    mask=mask,
    center=0,
)
plt.show()
  • Average Credit Limit has slightly positive correlation with Total Credit Cards (obviously) and Total Number of Visits Online, and slightly negative correlation with Total Calls Made
  • Total Credit Cards and Total Calls Made are negatively correlated
  • Total Visits Online is also negatively correlated with Total Visits to Bank

Average Credit Limit distribution by Each of the Other Attributes¶

In [148]:
# Function to plot numerical feature by each category with target hue

def plot_numeric_by_cat(data: pd.DataFrame, category_columns: list, numeric_column: str, hue: str = None):
    '''
    The function plots a numerical feature in box plot by every category column specified in the list, 
    with hue of a target category
    '''
    num_cols = 2
    num_rows = int(len(category_columns) /2 + 1)

    plt.figure(figsize=(20, 8*num_rows))
    for i, col in enumerate(category_columns):
        plt.subplot(num_rows, num_cols, i+1)
        sns.set(palette="nipy_spectral");
        sns.boxplot(data=data, x=col, y=numeric_column, hue=hue, showfliers=True).set(title = numeric_column + ' vs. ' + col );
In [149]:
plot_numeric_by_cat(data\
                    ,category_columns=['total_credit_cards', 'total_visits_bank', 'total_visits_online', 'total_calls_made']\
                    ,numeric_column='avg_credit_limit')

We can see clear segmentations with respect to each pair of features

Data Preprocessing¶

Before clustering, we should always scale the data, because, different scales of features would result in unintentional importance to the feature of higher scale while calculating the distances.

In [150]:
# scaling the dataset before clustering
data_copy = data.copy()
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.iloc[:,1:-1])

K-means Clustering¶

Finding the best numbr of centroids (K)¶

Elbow Curve to get the right number of Clusters¶

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

In [151]:
clusters = range(1, 10)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k)
    model.fit(data_scaled)
    prediction = model.predict(data_scaled)
    distortion = (
        sum(
            np.min(cdist(data_scaled, model.cluster_centers_, "euclidean"), axis=1)
        )
        / data_scaled.shape[0]
    )

    meanDistortions.append(distortion)


plt.plot(clusters, meanDistortions, "bo-")
plt.xlabel("k")
plt.ylabel("Average distortion")
plt.title("Selecting k with the Elbow Method")
for x,y in zip(clusters, meanDistortions):

    label = "{:.2f}".format(y)

    plt.annotate(label, # this is the text
                 (x + 0.2,y), # these are the coordinates to position the label
                 textcoords="offset points", # how to position the text
                 xytext=(0,10), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

Appropriate value for k seems to be 3

Let's check the silhouette scores.

Silhouette Scores¶

  • A score closer to 1 indicates that the data point is very similar to other data points in the cluster,

  • A score closer to -1 indicates that the data point is not similar to the data points in its cluster.

In [152]:
sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters)
    preds = clusterer.fit_predict((data_scaled))
    # centers = clusterer.cluster_centers_
    score = silhouette_score(data_scaled, preds)
    sil_score.append(score)
    

plt.plot(cluster_list, sil_score, "bo-")
plt.xlabel("k")
plt.ylabel("Silhouette Score ")
plt.title("Selecting k using Silhouette Score")
for x,y in zip(cluster_list, sil_score):

    label = "{:.2f}".format(y)

    plt.annotate(label, # this is the text
                 (x + 0.2,y), # these are the coordinates to position the label
                 textcoords="offset points", # how to position the text
                 xytext=(0,10), # distance from text to points (x,y)
                 ha='center') # horizontal alignment can be left, right or center
plt.show()

Silhouette score for 3 clusters is highest. So, we will choose 3 as value of k.

Let's also visualize the silhouettes created by each of the clusters for two values of K, 3 and 4

Visualize the Silhouettes¶

In [153]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(data_scaled)
visualizer.show()
Out[153]:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 660 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [154]:
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(data_scaled)
visualizer.show()
Out[154]:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 660 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

Clearly, 3 clusters seem very reasonable for this case study

Build the model with 3 centroids¶

In [155]:
# let's take 3 as number of clusters
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(data_scaled)
Out[155]:
KMeans(n_clusters=3, random_state=0)

Add the cluster numbers as a new attribute in the dataset¶

In [156]:
# adding kmeans cluster labels to the original and scaled dataframes
data_scaled_df = pd.DataFrame(data_scaled, columns=data.iloc[:,1:-1].columns.unique().tolist())
data['cluster'] = kmeans.labels_
data_scaled_df['cluster'] = kmeans.labels_

Customer Profiling - Visualize the Clusters with Features¶

In [157]:
plt.figure(figsize=(20, 26))

sns.pairplot(data_scaled_df, diag_kind='kde', corner=True, hue='cluster', palette='Set2');
<Figure size 2000x2600 with 0 Axes>
  • It appears, the method of contacting the bank (In Person/ Online/ Call) drives the clustering mechanism predominantly, we'll explore this in below plot
In [158]:
import plotly as py
import plotly.graph_objs as go

trace1 = go.Scatter3d(
    x= data['total_visits_bank'],
    y= data['total_visits_online'],
    z= data['total_calls_made'],
    mode='markers',
     marker=dict(
        color = data['cluster'], 
        size= 20,
        line=dict(
            color= data['cluster'],
            width= 12
        ),
        opacity=0.8
     ),

)
d = [trace1]
layout = go.Layout(
    title= 'Clusters',
    scene = dict(
            xaxis = dict(title  = 'Bank Visits'),
            yaxis = dict(title  = 'Online Visits'),
            zaxis = dict(title  = 'Call Customer Service')
        )
)
fig = go.Figure(data=d, layout=layout)
py.offline.iplot(fig)
In [159]:
sns.set(style='dark')
sns.scatterplot(x=data['avg_credit_limit'], y=data['total_credit_cards'], hue=data['cluster'], palette=['blue', 'red', 'orange'])
Out[159]:
<AxesSubplot:xlabel='avg_credit_limit', ylabel='total_credit_cards'>
In [160]:
cluster_profile = data.iloc[:,1:].groupby('cluster').mean()
cluster_profile['count_of_customers'] = data.groupby('cluster')['customer_key'].count()

cluster_profile.style.highlight_max(color='lightgreen').highlight_min(color='pink')
Out[160]:
  avg_credit_limit total_credit_cards total_visits_bank total_visits_online total_calls_made count_of_customers
cluster            
0 33782.383420 5.515544 3.489637 0.981865 2.000000 386
1 12174.107143 2.410714 0.933036 3.553571 6.870536 224
2 141040.000000 8.740000 0.600000 10.900000 1.080000 50

Analyzing the segments using Box Plot¶

In [161]:
data_scaled_df.boxplot(by='cluster', layout=(3,2), figsize=(10,14));
  • If we look at the data we see that there is a group which prefers online interactions with their bank, they have a much higher credit limit and also have more credit cards (cluster - 2).
  • The customers who prefer in-person interactions tend to have the mid-range of credit cards and credit limit (cluster - 0).
  • The customers who contact via phonecall are in another segment, who have lowest credit limit and number of cards (cluster - 1).

Hierarchical Clustering¶

Before starting clustering we'll remove the cluster column from the dataset.

In [162]:
k_means_clusters = data_scaled_df['cluster']
data_scaled_df.drop(columns=['cluster'], inplace=True)

I am going to try many distance metrics and linkage methods to find the best combination.

Cophenetic Correlations¶

The value should be very close to 1 for a high-quality solution. This measure can be used to compare alternative cluster solutions obtained using different algorithms.

In [163]:
# list of distance metrics
distance_metrics = ['braycurtis', 'canberra', 'chebyshev', 'cityblock', 'correlation',
        'cosine', 'euclidean', 'hamming', 'jaccard', 'mahalanobis', 'matching', 'minkowski', 'seuclidean',  'sqeuclidean']

# list of linkage methods
linkage_methods = ['complete', 'average', 'single', 'weighted']

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for dm in distance_metrics:
    for lm in linkage_methods:
        Z = linkage(data_scaled_df, metric=dm, method=lm)
        c, coph_dists = cophenet(Z, pdist(data_scaled_df))
        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm, c
            )
        )
        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm
Cophenetic correlation for Braycurtis distance and complete linkage is 0.5082233975395475.
Cophenetic correlation for Braycurtis distance and average linkage is 0.6052152783444972.
Cophenetic correlation for Braycurtis distance and single linkage is 0.8502845518924395.
Cophenetic correlation for Braycurtis distance and weighted linkage is 0.531818336516011.
Cophenetic correlation for Canberra distance and complete linkage is 0.46069443891594597.
Cophenetic correlation for Canberra distance and average linkage is 0.6751521558776901.
Cophenetic correlation for Canberra distance and single linkage is 0.7566178713003985.
Cophenetic correlation for Canberra distance and weighted linkage is 0.5351232761249246.
Cophenetic correlation for Chebyshev distance and complete linkage is 0.8533474836336782.
Cophenetic correlation for Chebyshev distance and average linkage is 0.8974159511838106.
Cophenetic correlation for Chebyshev distance and single linkage is 0.7382354769296767.
Cophenetic correlation for Chebyshev distance and weighted linkage is 0.8913624010768603.
Cophenetic correlation for Cityblock distance and complete linkage is 0.8731477899179829.
Cophenetic correlation for Cityblock distance and average linkage is 0.896329431104133.
Cophenetic correlation for Cityblock distance and single linkage is 0.7252379350252723.
Cophenetic correlation for Cityblock distance and weighted linkage is 0.8825520731498188.
Cophenetic correlation for Correlation distance and complete linkage is 0.48880336464362406.
Cophenetic correlation for Correlation distance and average linkage is 0.6601199957880307.
Cophenetic correlation for Correlation distance and single linkage is 0.6102702201143325.
Cophenetic correlation for Correlation distance and weighted linkage is 0.5317441799993322.
Cophenetic correlation for Cosine distance and complete linkage is 0.5316900096793689.
Cophenetic correlation for Cosine distance and average linkage is 0.6720803242310263.
Cophenetic correlation for Cosine distance and single linkage is 0.7473448481710088.
Cophenetic correlation for Cosine distance and weighted linkage is 0.5660201015465876.
Cophenetic correlation for Euclidean distance and complete linkage is 0.8599730607972423.
Cophenetic correlation for Euclidean distance and average linkage is 0.8977080867389372.
Cophenetic correlation for Euclidean distance and single linkage is 0.7391220243806552.
Cophenetic correlation for Euclidean distance and weighted linkage is 0.8861746814895477.
Cophenetic correlation for Hamming distance and complete linkage is 0.20179683067025023.
Cophenetic correlation for Hamming distance and average linkage is 0.5473662896769675.
Cophenetic correlation for Hamming distance and single linkage is 0.7080851663862526.
Cophenetic correlation for Hamming distance and weighted linkage is 0.6057204454928228.
Cophenetic correlation for Jaccard distance and complete linkage is 0.20179683067025023.
Cophenetic correlation for Jaccard distance and average linkage is 0.5473662896769675.
Cophenetic correlation for Jaccard distance and single linkage is 0.7080851663862526.
Cophenetic correlation for Jaccard distance and weighted linkage is 0.6057204454928228.
Cophenetic correlation for Mahalanobis distance and complete linkage is 0.5422791209801747.
Cophenetic correlation for Mahalanobis distance and average linkage is 0.8326994115042134.
Cophenetic correlation for Mahalanobis distance and single linkage is 0.7058064784553606.
Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.7805990615142516.
Cophenetic correlation for Matching distance and complete linkage is 0.20179683067025023.
Cophenetic correlation for Matching distance and average linkage is 0.5473662896769675.
Cophenetic correlation for Matching distance and single linkage is 0.7080851663862526.
Cophenetic correlation for Matching distance and weighted linkage is 0.6057204454928228.
Cophenetic correlation for Minkowski distance and complete linkage is 0.8599730607972423.
Cophenetic correlation for Minkowski distance and average linkage is 0.8977080867389372.
Cophenetic correlation for Minkowski distance and single linkage is 0.7391220243806552.
Cophenetic correlation for Minkowski distance and weighted linkage is 0.8861746814895477.
Cophenetic correlation for Seuclidean distance and complete linkage is 0.8599730607972426.
Cophenetic correlation for Seuclidean distance and average linkage is 0.8977080867389373.
Cophenetic correlation for Seuclidean distance and single linkage is 0.7391220243806551.
Cophenetic correlation for Seuclidean distance and weighted linkage is 0.8861746814895477.
Cophenetic correlation for Sqeuclidean distance and complete linkage is 0.8820964814996479.
Cophenetic correlation for Sqeuclidean distance and average linkage is 0.8783309583061251.
Cophenetic correlation for Sqeuclidean distance and single linkage is 0.7284473846766318.
Cophenetic correlation for Sqeuclidean distance and weighted linkage is 0.893679734763713.
In [164]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
        high_cophenet_corr, high_dm_lm[1]
    )
)
Highest cophenetic correlation is 0.8977080867389373, which is obtained with average linkage.

We see that the cophenetic correlation is maximum with Euclidean distance and Average Linkage`

Dendograms¶

A dendrogram, inWe will use a dendrogram to figure out the best way to allocate to clusters

In [165]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(data_scaled_df, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(data_scaled_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )
  • The cophenetic correlation is highest for average linkage methods.
  • 3 appears to be the appropriate number of clusters from the dendrogram for average linkage.

Build Agglomerative Clustering model¶

Also known as bottom-up approach or hierarchical agglomerative clustering (HAC). Bottom-up algorithms treat each data as a singleton cluster at the outset and then successively agglomerates pairs of clusters until all clusters have been merged into a single cluster that contains all data.

Build Model¶

In [166]:
HCmodel = AgglomerativeClustering(n_clusters=3, affinity="euclidean", linkage="average")
HCmodel.fit(data_scaled_df)
Out[166]:
AgglomerativeClustering(linkage='average', n_clusters=3)

Assign cluster labels¶

In [167]:
data_scaled_df['hc_cluster'] = HCmodel.labels_
data['hc_cluster'] = HCmodel.labels_

Cluster profile¶

In [168]:
cluster_profile_hc = data.iloc[:,1:].groupby('hc_cluster').mean()
cluster_profile_hc['count_of_customers'] = data.groupby('hc_cluster')['customer_key'].count()

cluster_profile_hc.style.highlight_max(color='lightgreen').highlight_min(color='pink')
Out[168]:
  avg_credit_limit total_credit_cards total_visits_bank total_visits_online total_calls_made cluster count_of_customers
hc_cluster              
0 33713.178295 5.511628 3.485788 0.984496 2.005168 0.002584 387
1 141040.000000 8.740000 0.600000 10.900000 1.080000 2.000000 50
2 12197.309417 2.403587 0.928251 3.560538 6.883408 1.000000 223

Analyzing the segments using Box Plot¶

In [169]:
data_scaled_df.boxplot(by='hc_cluster', layout=(3,2), figsize=(10,14));
  • If we look at the data we see that there is a group which prefers online interactions with their bank, they have a much higher credit limit and also have more credit cards (cluster - 1).
  • The customers who prefer in-person interactions tend to have the mid-range of credit cards and credit limit (cluster - 0).
  • The customers who contact via phonecall are in another segment, who have lowest credit limit and number of cards (cluster - 2).

Checking the clusters for the duplicated customer keys¶

In [170]:
data[data['customer_key'].isin(data[data['customer_key'].duplicated()]['customer_key'].tolist())].sort_values('customer_key')
Out[170]:
customer_key avg_credit_limit total_credit_cards total_visits_bank total_visits_online total_calls_made cc_spending_bin cluster hc_cluster
Sl_No
49 37252 6000 4 0 2 8 Very Low 1 2
433 37252 59000 6 2 1 2 High 0 0
5 47437 100000 6 0 12 3 High 2 1
333 47437 17000 7 3 1 0 Low 0 0
412 50706 44000 4 5 0 2 Mid 0 0
542 50706 60000 7 5 2 2 High 0 0
392 96929 13000 4 5 0 0 Low 0 0
399 96929 67000 6 2 2 2 High 0 0
105 97935 17000 2 1 2 10 Low 1 2
633 97935 187000 7 1 7 0 High 2 1

If we consider the duplicate records are actually updated records for the same customer, then it can be observed that 3 of the 5 customers have actually changed their clusters/groups. It appears, providing credit limit increase, or turning the customers to digital banking customers, we can actually move the customers to a more desirable and profitable cluster.

PCA for Dimensionality Reduction¶

Although there are only 5 dimensions, it'll be really cool to be able to visualize the clusters at 3 dimensional space without loosing much of the information. Let's use PCA to reduce the dimensions so that 90% of the variance in the data is explained.

In [190]:
data_scaled_df_pca = data_scaled_df.drop('hc_cluster', axis=1).copy()
In [191]:
pca = PCA()
pca.fit(data_scaled_df_pca)
Out[191]:
PCA()

Let's check the variance explained by individual components.

In [192]:
pca.explained_variance_ratio_
Out[192]:
array([0.45736578, 0.37426483, 0.06401348, 0.0555688 , 0.04878711])
In [193]:
# visualizing the variance explained by individual principal components
sns.set(style='darkgrid')
plt.figure(figsize=(10, 10))
plt.plot(
    range(1, 6), pca.explained_variance_ratio_.cumsum(), marker="o", linestyle="--"
)
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
Out[193]:
Text(0, 0.5, 'Cumulative Explained Variance')

For 90% variance explained, the number of components looks to be 3.

In [194]:
pca = PCA(
    n_components=3, svd_solver="full"
)  # svd_solver=full helps in faster convergence in case of very large data set
pca.fit(data_scaled_df_pca)
Out[194]:
PCA(n_components=3, svd_solver='full')
In [195]:
# checking the variance explained by individual components.
print('Explained variance = {var} %'.format(var=round(pca.explained_variance_ratio_.sum(),2)*100))
Explained variance = 90.0 %
In [196]:
plt.figure(figsize=(10, 10))
plt.plot(
    range(1, 4), pca.explained_variance_ratio_.cumsum(), marker="o", linestyle="--"
)
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
Out[196]:
Text(0, 0.5, 'Cumulative Explained Variance')
In [197]:
subset_pca = pca.transform(data_scaled_df_pca)
subset_pca_df = pd.DataFrame(subset_pca)
In [198]:
sns.pairplot(subset_pca_df, diag_kind='kde');

Hierarchical Clustering on lower-dimensional data¶

In [199]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]

# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]

# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))

# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
    Z = linkage(subset_pca_df, metric="euclidean", method=method)

    dendrogram(Z, ax=axs[i])
    axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")

    coph_corr, coph_dist = cophenet(Z, pdist(subset_pca_df))
    axs[i].annotate(
        f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
        (0.80, 0.80),
        xycoords="axes fraction",
    )

Observations

  • The cophenetic correlation is highest for average and centroid linkage method, but I will check with complete linkage as it has more distinct and separated clusters, and a cophenetic correlation of 0.89 (highest being 0.9).
  • 3 appears to be the appropriate number of clusters from the dendrogram for complete linkage as well.
In [200]:
hc = AgglomerativeClustering(n_clusters=3, affinity="euclidean", linkage="complete")
hc_labels = hc.fit_predict(subset_pca)
In [201]:
# adding hierarchical cluster labels to the original dataframe
data_pca = data.drop(['cluster', 'hc_cluster'], axis=1)
data_pca['pca_hc_cluster'] = hc_labels
In [202]:
data_scaled_pca_df = data_scaled_df.drop('hc_cluster', axis=1)
data_scaled_pca_df['pca_hc_cluster'] = hc_labels

Cluster Profiling¶

In [203]:
cluster_profile2 = data_pca.groupby('pca_hc_cluster').mean()
cluster_profile2['customer_count'] = (
    data_pca.groupby('pca_hc_cluster')['customer_key'].count().values
)
In [204]:
# let's display cluster profile
cluster_profile2.style.highlight_max(color="lightgreen", axis=0)
Out[204]:
  customer_key avg_credit_limit total_credit_cards total_visits_bank total_visits_online total_calls_made customer_count
pca_hc_cluster              
0 54951.709512 33591.259640 5.496144 3.467866 0.987147 2.002571 389
1 56708.760000 141040.000000 8.740000 0.600000 10.900000 1.080000 50
2 55120.814480 12217.194570 2.402715 0.936652 3.579186 6.932127 221
In [205]:
pca_df = subset_pca_df.copy()
pca_df['pca_hc_cluster'] = hc_labels

pca_df = pca_df.rename(columns={0:'component_1', 1:'component_2', 2:'component_3'})
In [206]:
sns.pairplot(pca_df, diag_kind='kde', hue='pca_hc_cluster');
In [207]:
import plotly as py
import plotly.graph_objs as go

trace1 = go.Scatter3d(
    x= pca_df['component_1'],
    y= pca_df['component_2'],
    z= pca_df['component_3'],
    mode='markers',
     marker=dict(
        color = pca_df['pca_hc_cluster'], 
        size= 20,
        line=dict(
            color= pca_df['pca_hc_cluster'],
            width= 12
        ),
        opacity=0.8
     ),

)
d = [trace1]
layout = go.Layout(
    title= 'Clusters',
    scene = dict(
            xaxis = dict(title  = 'Component 1'),
            yaxis = dict(title  = 'Component 2'),
            zaxis = dict(title  = 'Component 3')
        )
)
fig = go.Figure(data=d, layout=layout)
py.offline.iplot(fig)
In [208]:
data_scaled_pca_df.boxplot(by='pca_hc_cluster', layout=(3,2), figsize=(10,14));

Actionable insights and Recommendations¶

There are three significant types of customer:

  1. In-person users: prefer to handle bank transactions in person. They have the medium number of credit cards and the medium available credit. They are also the most active users.
  2. Phone users: prefer verbally handling transactions. They have fewest credit cards and lowest credit card limit.
  3. Online users: prefer digital transactions. They also have the most credit cards and the highest available credit. Least number of customers have this type of behaviour.

Suggestions¶

  • The customer preferences should be used to contact the customers.
  • Online users will most likley prefer contact through email
  • mobile phone users will probably prefer contact through email/text and should also be called
  • in-person users prefer mail notifications, and updates while inside the bank
  • once the in-person and phone users are contacted they should be marketed for online banking there seems to be a correlation between online banking and spending power.
  • the Online and in-person users should be targted the most for banking promotions since they have the most spending power
In [ ]: